Skip to content

Conversation

@andrewor14
Copy link
Contributor

This is a step towards consolidating SQLContext and HiveContext.

This patch extends the existing Catalog API added in #10982 to include methods for handling table partitions. In particular, a partition is identified by PartitionSpec, which is just a Map[String, String]. The Catalog is still not used by anything yet, but its API is now more or less complete and an implementation is fully tested.

About 200 lines are test code.

Andrew Or added 4 commits February 3, 2016 14:28
These are a subset of the public interfaces exposed by Hive.
This commit just adds the skeleton without implementing any of
them.
@andrewor14 andrewor14 changed the title [SPARK-13079] Extend Catalog API + implement InMemoryCatalog [SPARK-13079] [SQL] Extend and implement InMemoryCatalog Feb 4, 2016
@andrewor14
Copy link
Contributor Author

retest this please



object Catalog {
type PartitionSpec = Map[String, String]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to document this is mapping from column names to values.

@rxin
Copy link
Contributor

rxin commented Feb 4, 2016

LGTM otherwise. Feel free to merge this one and address issues in your next pull request.

@SparkQA
Copy link

SparkQA commented Feb 4, 2016

Test build #50728 has finished for PR 11069 at commit 1b40002.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Feb 4, 2016

I'm going to merge this.

@asfgit asfgit closed this in a648311 Feb 4, 2016
@andrewor14 andrewor14 deleted the catalog branch February 4, 2016 03:50
asfgit pushed a commit that referenced this pull request Feb 4, 2016
This patch incorporates review feedback from #11069, which is already merged.

Author: Andrew Or <andrew@databricks.com>

Closes #11080 from andrewor14/catalog-follow-ups.
asfgit pushed a commit that referenced this pull request Feb 21, 2016
## What changes were proposed in this pull request?

This is a step towards merging `SQLContext` and `HiveContext`. A new internal Catalog API was introduced in #10982 and extended in #11069. This patch introduces an implementation of this API using `HiveClient`, an existing interface to Hive. It also extends `HiveClient` with additional calls to Hive that are needed to complete the catalog implementation.

*Where should I start reviewing?* The new catalog introduced is `HiveCatalog`. This class is relatively simple because it just calls `HiveClientImpl`, where most of the new logic is. I would not start with `HiveClient`, `HiveQl`, or `HiveMetastoreCatalog`, which are modified mainly because of a refactor.

*Why is this patch so big?* I had to refactor HiveClient to remove an intermediate representation of databases, tables, partitions etc. After this refactor `CatalogTable` convert directly to and from `HiveTable` (etc.). Otherwise we would have to first convert `CatalogTable` to the intermediate representation and then convert that to HiveTable, which is messy.

The new class hierarchy is as follows:
```
org.apache.spark.sql.catalyst.catalog.Catalog
  - org.apache.spark.sql.catalyst.catalog.InMemoryCatalog
  - org.apache.spark.sql.hive.HiveCatalog
```

Note that, as of this patch, none of these classes are currently used anywhere yet. This will come in the future before the Spark 2.0 release.

## How was the this patch tested?
All existing unit tests, and HiveCatalogSuite that extends CatalogTestCases.

Author: Andrew Or <andrew@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #11293 from rxin/hive-catalog.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants